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This  report  describes  algorithms,  performance,  applications, 
and  user  information  associated  with  a  code  which  solves  a  memory- 
resident  single  banded  symmetric  matrix  equation  on  the  CRAY-1. 

The  code  is  available  as  part  of  a  library  of  CAL-coded 
equation-solvers,  TliH-. 

^  PREFACE 

'The  mathematical  software  described  herein  is  the  result  of 
experimental  research  on  vector  algorithms  for  the  direct  solution 
of  finite  element  grids  arising  in  structural  analysis.  It  re¬ 
presents  what  is  thought  to  be  the  best  compromise  between  vector- 
izability,  sparsity  exploitation,  and  user  convenience  for  such 
problems  for  the  CRaY-1. 


I.  Introduction 


When  direct  methods  are  used  in  the  solution  of  equations 
associated  with  2-D  finite  element  systems,  the  majority  of  pro¬ 
duction  codes  require  a  "frontal"  approach  [1],  i.e.,  the  finite 
elements  are  assembled  and  reduced  in  batches  along  a  front  that 
moves  across  the  grid.  This  procedure  saves  storage  of  the  entire 
profile  matrix  and  so  conserves  memory,  a  major  issue  for  the  scalar 
scientific  processors  of  the  1970's  with  fast  storage  often  less 
than  100,000  words.  Only  relatively  small  research  problems  can 
be  completely  assembled  and  then  completely  solved  in  main  memory. 
The  principal  difficulty  with  a  frontal  solution  is  programming 
complexity  due  to  solution  partitioning  and  to  I/O  management. 

In  contrast,  vector  processors  with  one-  and  two-megaword 
storage  permit  the  memory-resident  solution  of  the  larger  problems 
commensurate  with  their  speed.  Roughly,  matrices  associated  with 
square  grids  three  times  larger  on  a  side  can  now  be  solved  by  a 
vector  processor,  in  the  same  computation  time  and  at  the  storage 
limit  of  main  memory. 

previous  study  [2]  indicated  that,  for  unsymmetric  matrices, 
profile  solution  was  marginally  faster  than  banded  solution  on  the 
CRAY-1.  For  this  small  speedup,  significant  preprocessing  is  re¬ 
quired  to  block  the  profile  structure.  It  was  not  considered  worth¬ 
while  to  produce  a  symmetric  version  of  the  block  profile  solution. 


II.  Symme trie  Banded  Solution 


A.  Basic  algorithm 

Consider  an  nxn  symmetric  banded  matrix  A,  with  half-bandwj dth 
m.  The  solution  of 

AX  =  B 

is  performed  in  two  steps,  viz,  (1)  triangular  factorization 

A  =  UTDU 

where  D  is  a  diagonal  matrix,  U  is  an  upper  triangular  matrix,  and 
T 

U  is  the  transpose  of  U,  and  (2)  forward  and  backward  substitution 

T  -1 

Yx  =  (UA)  XB 


x  =  u_1y2 


Asymptotically  in  n,  factorization  of  a  symmetric  matrix  requires 
1/2  the  computation  of  an  unsymmetric  matrix;  the  substitution  steps 
require  the  same  computation. 

The  performance  of  the  algorithm  depends  on  the  vector  length 
for  small  bandwidths  and  on  the  data  flow  between  the  vector 
registers  and  main  memory  for  all  bandwidths.  Mathematically,  the 
average  vector  length  is  restricted  to  1/2  that  of  the  unsymmetric 
case.  The  data  flow  is  minimized  -  and  performance  optimized  - 
when  accumulation  is  made  into  a  single  row  or  column  from  pre¬ 
viously-factored  rows  and  columns;  a  poor  algorithm  creating 
excessive  data  flow  would  be  the  common  one  based  on  an  outer 
product  of  a  row  or  column. 

A  column-oriented  accumulation  suggested  by  Jordan  [3]  and 
modified  in  [2]  could  be  used  with  reduced  success  in  the  symmetric 

case,  due  to  the  halved  vector  length.  However,  the  accumulation  ] 

i 
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kernel  discussed  in  [2]  suffers,  for  small  bandwidths,  from  being 
instruction-bound  and,  for  large  bandwidths,  from  a  shift  operation 
in  the  chained  sequence  of  vector  inner loop  instructions.  In 
contrast,  the  following  algorithm  kernel  consists  of  simply  a 
vector-matrix  multiply,  which  can  be  made  quite  efficient. 

The  accumulation  is  represented  as  being  made  into  a  row  rather 
than  a  column  (Figure  1) .  The  product  of  the  row  vector  to  the  left 
of  the  main  diagonal  and  the  triangular  matrix  above  the  accumulant 
to  the  right  of  the  main  diagonal  is  then  added  to  the  accumulant 
row.  However,  this  simple  kernel  is  complicated  by  the  need  to  main¬ 
tain  components  of  both  U  and  DU  or  UD. 

The  organization  of  the  accumulation  kernel  has  the  following 
form  (see  Figure  la) .  Let 

Y  be  the  accumulant  row 

R  be  the  row  vector,  initially  stored  as  a  column  of 
DU  above  the  pivot 

C  (current  column)  be  a  column  of  U  to  be  computed 
from  and  stored  over  R 

D  be  an  appropriate  segment  of  the  diagonal,  before 
the  current  pivot  position 

M  be  the  triangular  matrix  DU  above  Y  and  including  R 
then  the  reduction  kernel  has  the  form 

T  <• -  D-1R 

Y  < -  Y  +  TM 

C  <s -  T 

The  current  column  C  is  a  column  of  U;  the  accumulant  Y  is  a  row 
of  DU.  T  is  a  temporary  vector  and  resides  in  a  vector  register. 

An  illustrative  Fortran  program  incorporating  this  algorithm 
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is  given  in  Table  1.  This  will  be  exercised  later  to  obtain 
performance  comparisons  with  the  more  efficient  assembly  code  to 


follow. 


B.  Partitioned  Solution 

When  the  half-bandwidth  is  greater  than  64,  so  that  vector 
lengths  must  be  truncated,  it  remains  efficient  to  preserve  the 
concept  of  a  vector-matrix  multiply.  The  oversized  matrix  is  then 
partitioned  intobandedge  and  interior  matrices,  all  of  maximum  dim¬ 
ension  64.  Figure  lb  illustrates  this  partitioning  process  in  both 
the  row  and  column  directions.  The  circled  numbers .  .(3)  represent 
the  nesting  level  of  the  computation,  with ^  being  the  innermost 
loop.  Loops  3  and  relate  the  order  of  the  block  processing;  this 
ordering  is  also  represented  by  the  circled  letters  a. . ,f. 

Two  vector-matrix  multiply  kernels  are  now  required;  one  for 
triangular  bandedge  matrices  and  one  for  rectangular  (interior) 
matrices.  The  later  kernels  can  achieve  very  high  performance  with 
short  vector  lengths.  For  vector  lengths  (VL)  greater  than  8,  the 
execution  rates  are  given  by 

M  FLOPS  =  160  (y^g) 

i.e.,  a  rate  of  80  MFLOPS  for  VL  =  8  and  142  MFLOPS  for  VL  =  64. 


bandedge 


(a)  Overview 


(b)  Partitioned  View 

Figure  1.  Organization  and  terminology  of  a  reduction  step 


•;****  ILLUSTRATIVE  SYMMETRIC  BANDED  EQUATION  SCLVER 
C  UNFARTITIO ft E D ;  F.CW  STORAGE  ONLY 

DIMENSION  A(IOOO),  B( 1 00 ) ,  TEMP(IOO) 

C**«*  M  IS  HALF-BANDWIDTH;  N  IS  NUMBER  OF  EQUATIONS 
10  READ  (5,20 ,END=90)  M,  N 
20  FORMAT  (2110) 

MP1  =  M  +  1 
M2  =  2  *  M  +  1 
DO  30  I  =  1  ,  N 
30  B( I )  =  0. 

C***«  FORMULATE  EQUATIONS  AND  RHS  TO  HAVE  SOLUTION  B(J)=J 
DO  50  I  =  1  ,  MP1 


50 

J  = 

1  , 

N 

IX 

=  I 

+  (J 

- 

1 )  *  MP1 

A( IX )  = 

0. 

Ml 

=  MAXO ( 1 

,M 

-1+2) 

IF 

(J  . 

GE. 

Ml) 

A(IX)  = 

—  1  • 

IF 

(I  . 

EQ. 

MP1 

)  A( IX)  : 

:  M2 

IY 

=  I 

+  J 

-  M 

-  1 

IF 

(IY 

.GT. 

0 

.AND.  I  , 

.NE.  MP1 ) 

B(  J )  =  B. 

IF 

(IY 

.GT. 

0) 

B( IY )  = 

B(  IY)  + 

(J)  *  A( IX ) 

50  CONTINUE 

C***«  TRIANGULARLY  FACTOR  MATRIX 
CALL  FACTORU,  TEMP,  M,  N) 

C»#**  FORWARD  AND  BACK  SUBSTITUTE 
CALL  SOLVE( A,  B,  M,  N) 

DO  60  J  =  1 ,  N 
AJ  =  J 

IF  (ABS(B(J)  -  AJ)  .GT.  1.E-6)  GO  TO  70 
60  CONTINUE 
GO  TO  10 

70  WRITE  (6,80)  J,  (B(I),I=1,N) 

80  FORMAT  ('  FIRST  WRONG  SOLUTION  VARIABLE  IS',  I5/(5E12.4)) 
90  STOP 
END 

SUBROUTINE  FACTORU,  TEMP,  M,  N) 

DIMENSION  TEMP(1),  A( 1 ) 

MP1  =  M  +  1 

A( MP 1  )  =  1  .EO  /  A(  MP1 ) 

DO  50  J  =  2,  N 
JM1  =  J  -  1 
Ml  =  MAXO ( 1 , J  -  M) 

ID  =  Ml  *  MP1  -  MP 1 

IX  r  Ml  +  J  *  M  -  1 

CDIR$  IVDEP 

DO  10  I  =  Ml ,  JM1 
IX  =  IX  +  1 

ID  =  ID  +  MP 1 

10  TEMP ( I )  =  A( IX )  *  A( ID) 


Table  1.  Simplified  Fortran  version  of  code 
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DO  30  I  =  Ml ,  JM1 

IX  =  I  -  J  +  J  *  MP1 
US  =  TFMP(I) 

IZ  =  MI NO ( N , M  +  I) 
IL=I+J*M-M 
LJ  =  J  *  MP1  -  M 
CDIR$  IVDEP 

DO  20  L  =  J,  IZ 
LJ  =  LJ  +  M 
IL  =  IL  +  M 

20  A(  LJ )  =  A(  LJ  )  -  A(  IL )  *  US 

30  CONTINUE 

IX  =  J  »  M  +  Ml  -  1 
CDI R$  IVDEP 

DO  40  I  =  Ml ,  JM1 
IX  =  IX  +  1 
40  A( IX )  =  TEMP ( I ) 

JJ  =  J  *  MP 1 
50  A( JJ)  =  1 .EO  /  A( JJ) 

RETURN 

END 

SUBROUTINE  SOLVE(A,  B,  M,  N) 
DIMENSION  A( 1 )  ,  B( 1 ) 

MP1  =  M  +  1 

NM1  =  N  -  1 

DO  10  I  =  1 ,  NM1 

IP1  =1+1 
IL  =  I  *  MP 1 
Ml  =  MI NO ( N, I  +  M) 

CDIR$  IVDEP 

DO  10  L  =  IP1 ,  Ml 
IL  =  IL  +  M 

10  B(L)  =  B ( L )  -  A( IL )  *  B( I ) 
CDIR$  IVDEP 
II  =  0 

DO  20  I  =  1  ,  N 
II  =  II  +  MP 1 
20  B(I)  =  B( I )  *  A( II ) 

DO  30  L  =  1 ,  NM1 
LL  =  N  -  L  +  1 
LLM1  =  LL  -  1 
Ml  =  MAXO ( 1 , LL  -  M) 

ILL  =  Ml  +  LL  *  M  -  1 
CDIR$  IVDEP 

DO  30  I  =  Ml ,  LLM1 
ILL  =  ILL  +  1 

30  B ( I )  =  B(  I )  -  A(  ILL)  *  B(  LL) 
RETURN 
END 


Table  1.  Continued 


V 


III.  Software  Description 
A.  Storage  Options 

It  is  common  to  store  the  diagonal  and  the  U  matrix  in  com- 

* 


pressed  form  in  an  array  of  dimension  ty* (M+l)  .  Figure  2  illus¬ 
trates  eight  possible  regularly-addressed  storage  patterns;  in 
each  case,  u^  may  be  replaced  by  f  _  to  represent  the  storage  of 

m 

L(=U')  rather  than  U.  Fortunately,  all  of  these  cases  may  be  ac¬ 
commodated  by  defining  suitable  parameters  of  the  argument  lists 
of  the  following  routines.  The  key  to  the  generality  is  the  pass¬ 
ing  of  the  (1,1)  position  of  the  matrix  rather  than  the  first  ele¬ 
ment  of  the  matrix  storage  array;  all  indexing  is  then  performed 
off  of  this  base. 


B.  Calling  Sequences 
Factorization 


CALL  SDANF  (N , M , A (Nil ) , ND IAG , NDROW) 


where 

N  is  the  number  of  equations 

M  is  the  half-bandwidth  (not  including  the 

diagonal ) 

A(N11)  is  the  (1,1)  element  of  the  matrix 

NDIAG  is  the  storage  increment  between  successive 
diagonal  elements 

NDROW  is  the  storage  increment  between  successive 
column  elements 

Substitution 

CALL  SEANS  (N ,M, A (Nil ), NDIAG , NDROW, Y) 


★ 

See  following  discussion  for  symbol  definitions 


*  ***** 
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* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

dl 

u12 

U13 

U14 

U14 

U13 

U12 

dl 

d2 

u23 

u24 

U2S 

u25 

U24 

U23 

d2 

d3 

u34 

U35 

u36 

U36 

u35 

U35 

d3 

d4 

u45 

u46 

* 

* 

U46 

u4  5 

d4 

d5 

u56 

* 

* 

* 

* 

U56 

d. 

D 

d6 

* 

* 

* 

* 

* 

* 

d6 

Nil  -1,-NDROW—  (N-l)  ; 

NDCOL»N;NDIAG»l 

(C)  Nll«N*M+l>NDROW«N- 

NDCOL— N ;  NDIAG-1 

dl 

* 

* 

* 

* 

* 

it 

dl 

d2 

U12 

* 

* 

* 

* 

U12 

d2 

d3 

u23 

u13 

* 

* 

u13 

U23 

d3 

d4 

u34 

U24 

U14 

U14 

U24 

U34 

d4 

d5 

u45 

u35 

u25 

U25 

U35 

U45 

d5 

d6 

u56 

u46 

u36 

U36 

U46 

U56 

d« 

* 

* 

* 

* 

* 

• 

* 

* 

• 

* 

* 

* 

(b)  Nll»l  ;NDROW— N;  (d)  N11»N*M+1 ; NDROW-N : 

NDCOL-N+l;NDIAG»l  NDCOL-- (N-l) ;NDIAG“1 


Figure  2. 


Permitted  compressed  storage;  d^ 
diagonal  element;  replace  by 
L  is  stored;  NDCOL  =  WOIAG-NDROW. 


is  ith 
■.  j  ^  when 


L. 


« 
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where 

N...NDROW  are  defined  above,  except  that  A (Nil)  is  the 
(1,1)  element  of  the  factorized  matrix 

Y  is  the  right  hand  side  on  entry  and  the  solution 
on  exit. 

C.  Comments 

1.  Let  NDCOL  =  NDIAG-NDROW  be  the  distance  between  successive 
elements  in  a  row.  When  (NDCOL|  is  a  multiple  of  eight, 
performance  of  both  the  factorization  and  foward  substi¬ 
tution  step  is  severly  degraded  by  a  factor  approaching 
four.  When  |NDROW|  or  |NDIAG|  are  multiples  of  eight,  some 
degradation  will  also  be  noted  for  small  bandwidths. 

2.  The  dimension  of  Y  must  be  at  least  N  +  M  +  1. 

3.  The  storage  of  the  matrix  may  have  to  be  increased  to  assure 
that  certain  data  outside  the  normal  matrix  storage  can  be 
operated  upon  as  floating  point  numbers.  These  positions 
are  indicated  by  asterisks  in  Figure  1.  For  example,  in 
Figure  1(a),  the  solver  will  access  the  data  "above"  the 
normal  matrix  storage,  use  it  as  operands  for  floating  point 
add  and  multiply,  but  will  not  store  the  results.  In  this 
case  additional  storage  need  not  be  allocated,  since  these 
operands  will  simply  be  fetched  from  the  preceding  column. 
Only  when  this  fetched  data  represents  a  fixed  point  or  in¬ 
struction  format  can  floating  point  exceptions  be  expected. 

D.  Driver  Program 

Appendix  A  contains  a  listing  of  a  Fortran  driver  program  that 
formulates  equations  that  are  diagonally  dominant  and  that  are  stored 


so  that  neither  |NROW|,  |NDCOLj,  or  jNDIAG]  are  multiples  of  eight, 
thus  avoiding  memory  bank  conflicts. 


IV.  Performance 

Table  2  gives  the  measured  solution  times  and  execution  rates 
associated  with  solving  1024  equations  on  the  CRAY-1.  Among  the 
more  interesting  results  are  the  rates  for  solving  small-bandwidth 
cases,  in  comparison  with  the  unsymmetric  solver  of  [2]  that  has 
twice  the  average  vector  length.  For  half-bandwidths  of  8,  16,  and 
32,  the  unsymmetric  factorization  executes  at  18,  44,  and  88  MFLOPS, 
respectively.  From  Table  2  the  corresponding  rates  are  13.4,  34.0, 
and  68.9  MFLOPS.  The  asymptotic  rates  are  similar  for  both  solvers. 

It  should  be  pointed  out  that  the  timings  in  [2]  were  obtained 
on  the  COS  operating  system;  CTSS  was  used  to  produce  the  results  of 
Table  2. 
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Half- 

Bandwidth 

Factorization 

4 

.00508/5.03 

8 

.00616/13.4 

16 

.00865/34.0 

32 

.  0160/68 . 9 

64 

.0401/105. 

65 

.0554/78.7 

68 

.0559/85.1 

80 

.0687/95.0 

96 

.0894/104. 

128 

.139/117. 

129 

.164/101. 

132 

.165/105. 

144 

.189/108. 

160 

.218/115. 

196 

. 328/113. 

197 

. 328/114. 

200 

.335/115. 

Table  2  •  Execution  time  (sec)  and  rate  (MFLOPS)  to 


Substitution 
.00122/14.2 
.00122/27.6 
.00122/54.1 
.00201/64.7 
.00323/78.9 
.00406/63.7 
.00449/60.2 
.00450/70.1 
.  00468/80.2 
.00552/89.2 
.00624/79.4 
.  00667/75.9 
.00674/81.4 
.  00686/88.1 
. 00886/82.0 
. 00886/82.3 
.00886/83.5 

solve  1024  equations 
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Appendix  A 

Listing  of  Driver  Program 
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